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Abstract 

Recently, Andrieu, Doucet and Holenstein [l] introduced a general 
framework for using particle filters (PFs) to construct proposal kernels for 
Markov chain Monte Carlo (MCMC) methods. This framework, termed 
Particle Markov chain Monte Carlo (PMCMC), was shown to provide 
powerful methods for joint Bayesian state and parameter inference in 
nonlinear/non-Gaussian state-space models. However, the mixing of the 
resulting MCMC kernels can be quite sensitive, both to the number of 
particles used in the underlying PF and to the number of observations in 
the data. In the discussion following [l], Whiteley [2] suggested a mod- 
ified version of one of the PMCMC samplers, namely the particle Gibbs 
(PC) sampler, and argued that this should improve its mixing. In this 
paper we explore the consequences of this modification and show that it 
leads to a method which is much more robust to a low number of particles 
as well as a large number of observations. Furthermore, we discuss how 
the modified PC sampler can be used as a basis for alternatives to all 
three PMCMC samplers derived in [T]. We evaluate these methods on 
several challenging inference problems in a simulation study. One of these 
is the identification of an epidemiological model for predicting infiuenza 
epidemics, based on search engine query data. 

1 Introduction 

Particle Markov chain Monte Carlo (PMCMC) is an umbrella for a collection 
of methods introduced in [l]. The fundamental idea underlying these methods, 
is to use a sequential Monte Carlo (SMC) sampler, i.e. a particle filter (PF), to 
construct a proposal kernel for an MCMC sampler. The resulting methods were 
shown to be powerful tools for joint Bayesian parameter and state inference 
in nonlinear, non-Gaussian state-space models. However, to obtain reasonable 
mixing of the resulting MCMC kernels, it was reported that a fairly high number 
of particles was required in the underlying SMC samplers. This might seem like 
a very natural observation; in order to obtain a good proposal kernel based on a 
PF, we should intuitively use sufficiently many particles to obtain high accuracy 



in the PF. However, this is also one of the major criticisms against PMCMC. 
If wc need to run a PF with a high number of particles at each iteration of the 
PMCMC sampler, then a lot of computational resources will be wasted and the 
method will be very computationally intense (or "computationally brutal" as 
Flury and Shephard put it in the discussion following [l]). It is the purpose of 
this work to show that this need not be the case. We will discuss alternatives to 
each of the three PMCMC methods derived in [l], that will function properly 
even when using very few particles. The basic idea underlying these modified 
PMCMC samplers was originally proposed by Whiteley ^ . 

To formulate the problem that we are concerned with, consider a general, 
discrete-time state-space model with state-space X, parameterised by € 9, 

xt+i ~ fei.Xt+1 I xt), (la) 
Vt ~ geivt I xt). (lb) 

The initial state has density 7re(a;i). We take a Bayesian viewpoint and model 
the parameter as a random variable with prior density p{0). Given a sequence of 
observations yi;T = {2/1, • ■ ■ , 2/t}, we wish to estimate the parameter as well 
as the system state xi-^- That is, we seek the posterior density p{6, xi;t \ TJi-.t)- 
During the last two decades or so, Monte Carlo methods for state and pa- 
rameter inference in this type of nonlinear, non-Gaussian state-space models 
have appeared at an increasing rate and with increasingly better performance. 
State inference via SMC is thoroughly treated by for instance (4||6]. For the 
problem of parameter inference, the two overview papers [Tjjsj provide a good 
introduction to both frequentistic and Bayesian methods. Some recent results 
on Monte Carlo approaches to maximum likelihood parameter inference in state- 
space models can be found in [9 - 12 . Existing methods based on PMCMC will 



be discussed in the next section, where we also provide a preview of the material 
presented in the present work. 



2 An overview of PMCMC methods 

In [1], three PMCMC methods were introduced to address the inference problem 
mentioned above. These methods are referred to as particle Gibbs (PG), particle 
marginal Metropolis-Hastings (PMMH) and particle independent Metropolis- 
Hastings (PIMH). 

Let us start by considering the PG sampler, which is an MCMC method 
targeting the joint density ^(6*, a;i:T | Vi-.t)- In an "idealised" Gibbs sampler (see 



e.g. 13 for an introduction to Gibbs sampling), we would target this density 



by the following two-step sweep, 

• Draw 9* \ xi.t ^ p{6 \ xi^t, Vi-.t)- 

• Draw x\.j, \ 9* ^ pe*[xi;T \ Vi-.t)- 

The first step of this procedure can, for some problems be carried out exactly 
(basically if we use conjugate priors for the model under study). In the PG 
sampler, this is assumed to be the case. However, the second step, i.e. to sample 
from the joint smoothing density pe{xi.,T \ Vi-.t), is in most cases very difficult. 
In the PG sampler, this is addressed by instead sampling a particle trajectory 
x\.rp using a PF. More precisely, we run a PF targeting the joint smoothing 
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density. We then sample one of the particles at the final time T, according 
to their importance weights, and trace the ancestral lineage of this particle to 
obtain the trajectory x\.rp. However, as we shall see in Section 4.2 this can lead 
to very poor mixing when there is degeneracy in the PF (see e.g. ^ 14 for a 
discussion on PF degeneracy). This will inevitably be the case when T is large 
and/or N is small. 

One way to remedy this problem is to append a backward simulator to the 
PG sampler, leading to a method that we denote particle Gibbs with backward 
simulation (PG-BSi). The idea of using backward simulation in the PG context 
was mentioned by Whiteley [2] in the discussion following [I]. To make this 
paper self contained, we will present a derivation of the PG-BSi sampler in 
Section[4j The reason for why PG-BSi can avoid the poor mixing of the original 



PG sampler will be discussed in Section 4.3 Furthermore, in Section |4.2| we 



shall see that the PG-BSi sampler can operate properly even with very few 
particles and vastly outperform PG in such cases. 

Now, as mentioned above, to apply the PG and the PG-BSi samplers we need 
to sample from the conditional parameter density p(9 \ Xi;T,yi:T)- This is not 
always possible, since conjugate priors can be both undesirable and unavailable 
(e.g. if there is a nonlinear parameter dependence in the model). For such 
cases, the PMMH of [l] can be used. Similarly to PG, this method targets the 
joint posterior density p{9,Xi;T \ Ui-.t), but parameter values are drawn from 
some proposal kernel q(0 \ 0') in a Metropolis-Hastings (MH) fashion. Hence, 
the method does not require the density p{9 \ xi:t,J/1:t) to be available for 
sampling. 

In this work, we propose an alternative method called Metropolis within 
particle Gibbs (MwPG), derived in Section [5] The MwPG sampler is based on 
the PG-BSi sampler, but instead of sampling exactly from p{9 \ Xi-TtVi-.t) we 
use an MH step to sample the parameters. This might at first seem superfiuous 
and appear as a poor imitation of the PMMH sampler. However, as we shall see 
in Section 5.2 the MwPG sampler can vastly outperform PMMH, even when 
the two methods use the exact same proposal density to sample the parameters. 
The reason is that PMMH requires accurate estimates of the likelihood peiyi-.r) 
to get a high acceptance probability [I]. This implies that also this method 
requires a high number of particles in the PF. On the contrary, as mentioned 
above, the PG-BSi sampler has the ability to function properly even for a very 
low number of particles. Since they share a common base, this desirable property 
of the PG-BSi sampler will "rub off" on the MwPG sampler. 

Finally, the PIMH introduced in (T] is as a method targeting the joint 
smoothing density pb(xi;t \ Ui-.t) when the parameter 6 is assumed to be 
known. Hence, it is a competitor to existing particle smoothing algorithms, 
see e.g. 14 ■ 17 . The PIMH can be seen as a collapsed version of the PMMH, 
and will sufii'er from the same problem when we decrease the number of par- 
ticles. To mitigate this, we propose to instead use a collapsed version of the 
PG-BSi sampler. The details are given in Section |6] and the proposed method 
is compared with the PIMH and a state of the art particle smoother 16 in 
Section 16.31 



Since the publication of the seminal paper [l], several contributions have 
been made which relate to the present work. We have published a preliminary 



version of the PG-BSi sampler in 18 . We have also applied this to the problem 



of Wiener system identification in 19 . In 20 , a version of PG with backward 
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simulation is used for inference in switching linear dynamical systems. In this 
work, it was observed that the backward sampling step could indeed drastically 
improve the performance of the PG sampler. In [s] a similar approach is used 
for multiple change-point models. 

Backward simulation has also been used in the context of PIMH and PMMH 



21 . However, it should be noted that their approach is fundamentally different 



from the MwPG sampler suggested in this work. In [21], the aim is to "improve 
the quality" of the sampled state trajectories. The acceptance probabilities in 



the MH samplers suggested in 21 are identical to those used in the original 
PMMH and PIMH samplers. Hence, they will suffer from the same problems 
when using few particles. 



3 Particle filter and backward simulator 

Before we go in to the details of the PG-BSi method in Section |4j we review the 
PF and the backward simulator. Throughout this section we assume that the 
parameter 9 is fixed and discuss how to approximately sample from the joint 
smoothing density pe{xi;T \ Vi-.t)- 

3.1 Auxiliary particle filter 

The auxiliary particle filter (APF) [22] can be used to approximate (and ap- 
proximately sample from) the joint smoothing distribution. The reason for why 
we choose an APF, rather than a "standard" PF, is that it is more general and 
that the notion of auxiliary variables to determine ancestor indices is well suited 
for the derivation of the PG-BSi sampler in Section [4] 

We shall assume that the reader is familiar with the APF and make the pre- 
sentation quite brief, mainly just to introduce the required notation. For readers 
not famihar with the PF or the APF, we refer to [4]|5][22]. Let J^^^ 
be a weighted particle system targeting pe(xi;t-i \ yi-.t-i)- That is, the particle 
system defines an empirical distribution, 

which approximates the target distribution. In the APF, we propagate this 
sample to time t by sampling from a proposal kernel. 

Here, the variable it is the index to an "ancestor particle" xl'_i and is a 
proposal kernel which proposes a new particle at time t given this ancestor. 
The factors = i'f_i{xl'_^,yt), known as adjustment multiplier weights, are 
used in the APF to increase the probability of sampling ancestors that better 
can describe the current observation. If these adjustment weights are identically 
equal to 1, we recover the standard PF. 

Once we have generated N new particles (and ancestor indices) from the 
kernel , the particles are assigned importance weights according to — 
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(x™, for TO = 1, . . . , TV, where the weight function is given by, 

W\xt xt i) = I xt)fe{xt \ xt-i) 

This results in a new weighted particle system {a;™^, w™}^^]^, targeting the 
joint smoothing density at time t. 

The method is initialised by sampling from a proposal density ~ Ri{xi) 
and weighting the samples according to = Wf{x™) where the weight func- 
tion is given by, W^{xi) = ge{yi \ xi)7re(xi)/i??(a;i). 

After a complete run of the APF we can sample a trajectory from the empir- 
ical joint smoothing density by choosing particle with probability propor- 
tional to w™. Note that this particle trajectory consists of the ancestral lineage 
of . Hence, if this lineage is defined by the indices bi-x (with bx = to) we 
obtain a sample 

■^I'T" "^IT^ 1 ■ • ■ 7 Xrp ^. 

For later reference, we note that the proposal kernel used in the APF is not 
allowed to be completely arbitrary. In principle, we need to make sure that the 
support of the proposal covers the support of the target. Hence, we make the 
following assumption. 

Assumption 1. Let S = {9 G & : p{9) > 0}. Then, for any 9 (1 S and any 
t G {1, . . . , T}, Sf C Qf where we have defined, 

= {xi..t G X* : pe{xi..t \ Vi-.t) > 0}, 
Qt = {xi:t e X* : R^ixt I xt-i)pg{xi.,t^i I yi.,t-i) > 0}. 

3.2 Backward simulation 

In the original PG sampler by [l], sample trajectories are generated as in 
However, due to the degeneracy of the APF, the resulting Gibbs sampler can 
suffer from poor mixing. We will illustrate this in Section |4.2[ As previously 
mentioned, we will focus on an alternative way to generate an approximate sam- 
ple trajectory from the joint smoothing density, known as backward simulation. 
It was introduced as a particle smoother in [14 23 . In [2|, it was suggested as 



a possible way to increase the mixing rate of a PG kernel. 
Consider a factorisation of the joint smoothing density as, 



T-l 



Pe{xi:T I yi-.r) = Pe{xT \ yi-.r) Y\_ Psi^t \ xt+i,yi:t)- 

Here, the backward kernel density can be written as, 

/ I N fe{xt+i I xt)pe{xt I yi:t) 

Pe[xt I xt+i,yi:t) = 7 . ^ ■ 

Pe{xt+i I yi:tj 

We note that the backward kernel depends on the filtering density pg{xt \ yi-.t)- 
The key enabler of the backward simulator is that this density (in many cases) 
can be readily approximated by an (A)PF, without suffering from degeneracy. 
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Algorithm 1 Backward simulator 



1. Initialise: Set jx — m with probability w^/J^i ^t- 

2. For < = T - 1 : -1 : 1 do: 

(a) Given xl*^^ , compute the smoothing weights, 

Wf\rp = : , m—1, . . . , N. 



(b) Set jt = m with probability w^rp. 



By using the APF to approximate the filtering density, we obtain an approxi- 
mation of the backward kernel, 

Pe [dxt xt+i,yi,t) = 2^ = — — (T^rit^^Jt . 

^^T.iwlf9{xt+i\x[) 

Based on this approximation, we may sample a particle trajectory backward in 
time, approximately distributed according to the joint smoothing density. The 
procedure is given in Algorithm [ij The algorithm generates a set of indices ji-T 
defining a backward trajectory according to 

x\.,^ = {xl\...,x^}. (4) 

Note that, since we only need to generate a single backward trajectory, the 
computational cost of the backward simulator is 0{N), i.e. of the same order 
as the APF. Hence, the computational complexity of the PG sampler is more 
or less unaffected by the introduction of the backward simulation step. Still, it 
is possible to reduce the complexity of the backward simulator even further by 
making use of rejection sampling |16) . 



4 Particle Gibbs with backward simulation 

In this section we will derive the PG-BSi sampler and discuss its properties. 
This forms the basis for all the methods proposed in this work. 



4.1 The PG-BSi sampler 

Recall the idealised Gibbs sampler, briefly outlined in Section [2] The two steps 
of the method which are performed at each iteration are, 

9* \ Xl:T ^ P{6 \ Xl:T,yi:T), (5a) 

x{:T I 0* Pe* [xi-.T I yi-.r)- (5b) 
As previously mentioned, the idea behind the PG sampler by [l] is, instead 



of sampling according to (5b), to run an APF to generate a sample trajectory 
as m In the PG-BSi sampler, we instead generate a sample trajectory 



■-l-.T 



by backward simulation, as in Q. However, in either case, to simply replace 
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step (5b) of the idealised Gibbs sampler with such an approximate sampling 
procedure will not result in a valid approach. In [l] they solve this by replacing 
the APF with a so called conditional APF (CAPF), and show that the resulting 
PG sampler is a valid MCMC method. In this section, we make a similar 
derivation for the PG-BSi sampler. 

The key in deriving the PG-BSi sampler is to consider a Markov chain in 
which the state consists of all the random variables generated by the particle 
filter and the backward simulator. Hence, the state of the Markov chain will not 
only be the particle trajectory , . . . , x^, but the complete set of particles, 
particle indices and backward trajectory indices. For this cause, let us start by 
determining the density of all the random variables generated by the APF. Let 
xt = {xl, . . . , x^} and U — {il, . . . , i^} be the particles and ancestor indices, 
respectively, generated by the APF at time t. Note that zj" is (the index of) 
the ancestor of particle xj" . For a fixed 9 the APF then samples from, 

N T N 

v'(xi:T,i2:T) =n^?(^i)n n ^'(C.^D- 

1=1 t=2m=l 

Note that the APF is simply a method which generates a single sample from 
on the extended space X^"^ x {1, . . . , N}^^'^-'^\ It can be beneficial to 
adopt this view on the APF in order to understand the inner workings of the 
PMCMC samplers. 

By completing this with a backward simulation according to Algorithm [T] 
we generate the indices ji, . . . , jr defining a backward trajectory. The joint 
density of {xi.y, i2:T, Ji:t} is then given by, 

. . , ,0f . , < ^ w^^fe{xr\x^^) 

V' (X1:T,12:T,J1:T)-V (xi:T, 12:t) X | I (6) 

Thus, the state trajectory x'J^gT = 2;^} will be distributed (jointly 

with the random indices ji-.r) according to a marginal of ([6]), ■(/'''^(a^i-T 'ii:?")' 
which in general is not equal to the joint smoothing density. In other words, 
the extracted trajectory tC -j^ , rj-\ IS not distributed according to pg{xi:T \ Ui-.t) and 



to simply use this trajectory in step (5b) of the idealised Gibbs sampler would 
not be a valid approach. 

The key to circumventing this problem is to introduce a new, artificial target 
density, with two properties: 

1. The artificial target density should admit the joint smoothing density as 
one of its marginals. 

2. The artificial target density should be "similar in nature" to '0^, to enable 
the application of an APF in the sampling procedure. 

Here, we define this artificial target density according to, 
Ala ^ . . X A P{0.x{];^ I yi-.r) 

nO,Xi,T,l2:T,Jl:T) = 

^ V{-^l:T,h:T) TT- w/^Jejxj' \ X^Ll) 
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Algorithm 2 CAPF - conditional auxiliary particle filter (conditioned on 
{x[,tJi:t}) 

1. Initialise: 

(a) Sample ^ Ri{xi) for m 7^ ji and set = x\. 

(b) Set = Wl{x^) for m = 1, . . . , A^. 

2. For < = 2, . . . , T do: 

(a) Sample {z™, x™} ~ M^(it,xt) for m ^ jt and set xf = x^. 

(b) Sample ij' according to, 

P(zf = m = ; : . 

(c) Set u;™ = xj" J for to = 1, . . . , A^. 



This construction is due to 21 , who apply backward simulation in the context 
of PMMH and PIMH. The artificial target density ([?]) differs fr om the one used 
in [1] in the last factor. 

One of the important properties of the density is formalised in the following 
proposition. 

Proposition 1. The marginal density 0/ {0, xj^^ , ji:^} under cf) is given by, 

<P{o,x'^:^,h..T) 



ji:T ■ ^ _ x^j!: I yi..T) 



Furthermore, with {0, Xi:T, i2:T, ii:T} being a sample distributed according to cj), 
the density o/{0,a;j!^} is given by p{0,x-[]i^ \ yi-.r)- 

Proof. See [2lj Theorem 1]. □ 

This proposition has the following important implication; if we can construct 
an MCMC method with stationary distribution cj), then the sub-chain given by 
the variables will have stationary distribution p{9,xi;T \ Ui-.t)- 

The construction of such an MCMC sampler is enabled by the second impor- 
tant property of our artificial density, namely that it is constructed in such a way 
that we are able to sample from the conditional 0(x^^^'^, i2:T | S,x-'-^.'^ ,ji;T)- 
Here we have introduced the notation x^^^-^ = {x^-'^ . . . , x^-*^} and x^-*' — 
{x\, . . . , Xj"**^ ^, a;|"'*''^^, . . . , x^}. This can be done by running a so called con- 
ditional APF (CAPF), introduced in This method is given in Algorithm [2] 
and its relationship to (f) is formalised in Proposition [2] 



^ The CAPF used in this work is sUghtly different from the conditional SMC introduced 
in The difference lies in step 2(b) - where we sample new ancestor indices, they simply set 
i(* = jt+i- The reason for this difference is that the indices ji-j- have different interpretations 
in the two methods. Here they correspond to backward trajectory indices, whereas in the 
original PC sampler they correspond to an ancestral path. 
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Proposition 2. The conditional (/)(x-^.^'^ , i2:T | ji:t) under 4> is given 

by, 

i/'''(xi:T,i2:T) -Q u't-i/eCif I 



Consequently, the CAPF given in Algorithm^ conditioned on {0,x-[]';^ ^ji-x}, 
generates a sample from this conditional distribution. 

Proof. See Appendix [A] □ 

Note that the conditional density considered in Proposition [2] only depends 
on known quantities, as opposed to the density defined in ([7| which depends 
explicitly on the joint posterior xi-t \ Ui-.t) (which in general is not known). 

As a final component of the PG-BSi sampler we need a way to generate a 
sample state trajectory. As previously pointed out, we aim to do this by running 
a backward simulator. The fact that the backward simulator indeed generates a 
sample from one of the conditionals of is assessed in the following proposition. 

Proposition 3. The conditional 4>{ji:T \ 0i'^1:T,'^2:t) under <j) is given by, 

Consequently, the backward simulator given in Algorithm [7} conditioned on 
{0, Xi:T, i2:T}; generates a sample from this conditional distribution. 

Proof. Sec Appendix [X] □ 

We are now ready to present the PG-BSi sampler, given in Algorithm [3] 
Based on the propositions and discussion above, the three steps of the method 
[2(a)-2(c)] can be interpreted as: 

(a) Dmw0* ^<j>{0\x{]^,ji..T), 

(b) Draw x*:^-''^^ , ij^^ - (j){x~fr^'-'^ ,i2:T \ 9*,x{]';^ Ji-.t), 

(c) Draw - (j}{ji.,T \ 6*^, x*'"^^^^, i* 

Hence, each step of Algorithm |3] is a standard Gibbs update, targeting the 
distribution (f>, and will thus leave (j> invarianl[^ It follows that (/> is a stationary 
distribution of the PG-BSi sampler. 

It is worth to note that the sampling scheme outlined above is not a "com- 
plete" multistage Gibbs sweep, since it does not loop over all the variables of the 
model. More precisely, we do not sample new locations for the particles x^^rf in 
any of the steps above (the same is true also for the original PG sampler). This 
might raise some doubts about ergodicity of the resulting Markov chain. How- 
ever, as established in Theorem [l] below, the PG-BSi sampler is indeed ergodic 

^In step (a) we marginalise over some of the variables before conditioning. This is known 
in the literature as collapsed Gibbs (see e.g. [13| Sec. 6.7]) and leaves the target distribution 
invariant. 
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Algorithm 3 PG-BSi - particle Gibbs w. backward simulation 



1. Initialise: Set 0{O), xi;t{0) and ji:T(0) arbitrarily. 

2. For r > 1, iterate: 

(a) Sample 9{r) - p{e \ xi.rir - 1), yi-.x)- 

(b) Rmi a CAPF targeting Pe{r)ixi-T I 2/it)i conditioned on {xi-xir — 
l),ji:T(r-l)}. 

(c) Run a backward simulator to generate ji-.Ti^)- Set xi-xir) to the corre- 
sponding particle trajectory. 



(as long as the idealised Gibbs sampler is), meaning that it is a valid MCMC 
method. Intuitively, the reason for this is that the collection of variables that is 
"left out" is chosen randomly at each iteration. 

Assumption 2. The idealised Gibbs sampler, defined by drawing alternately 
from p{9 I xi:T,yi:T) and pe{xi;T \ yi-.r) is p-irreducible and aperiodic (and 
hence converges for p- almost all starting points). 

Theorem 1. For any number of particles N >2, 

1. the PG-BSi sampler of Algorithm^defines a transition kernel on the space 
0xX^^x{l, Af}^(^^i)+^ with invariant density 4){6,xi:T,i2:T, ji-.r)- 

2. Additionally, if Assumption [7] and Assumption hold, then the PG-BSi 
sampler generates a sequence {9{r),xi-T{r)} whose marginal distribution 
C''{{e{r),x^.,T{r)}e-) satisfy, 

\\C''mr),xMr)} e •) -p{- I yi:T)\\TV^O 

as r — > cx) for p-almost all starting points (\\ ■ \\ tv being the total variation 
norm). 

Proof. Sec Appendix [X] □ 

It should be noted that [2|[20] motivate the use of backward simulation in 
the PG sampler in a slightly different way. They simply note that the backward 
simulator is a Markov kernel with invariant distribution given by the original 
artificial target proposed in [1^. Hence, they view the backward simulator as a 
way to resample the ancestor indices for the extracted trajectory x'[]^ . 



4.2 Numerical illustration 

We have argued that the PG-BSi sampler in Algorithm [3] should be preferable 
over the standard PG sampler in the sense that it produces a Gibbs sampler 
with better mixing properties. Since the PG-BSi sampler is the basis for all the 
methods proposed in this paper, we pause for a while and consider a numerical 
illustration of this improved mixing. Hence, consider the following nonlinear 
state-space model, 

Xt 

xt+i = Pixt + ' 2 +/33Cos(1.2i) + (8a) 

i- + x^ 

Vt ^ 0.05\xtr + eu (8b) 
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Figure 1: Running means for the estimated standard deviations and Ce for the 
PG sampler (left) and PG-BSi sampler (right) for a maximum of 1000 MCMC 
iterations. The four curves correspond to different number of particles. The 
"true" values are shown as thick solid lines. Note that the estimate of CTu does 
not converge to this "true" value, but this is not surprising since we consider 
just one data realisation. Hence, the posterior mean may very well differ from 
the "true" value. 



where xi ~ A/'(0,5), ~ A/'(0,f7^) and et ~ A/'(0,crg); here A^(0,cr^) denotes 
a zero- mean Gaussian with variance cr^. For the time being, we assume that 
{/?!, /32, /Sa, a} = {0.5, 25, 8, 2} are fixed and consider the problem of estimating 
the noise variances (see Section 5.3 for an experiment in which all the parameters 
are identified). Hence, we set 9 = {a^^al}. The same model was used in [l] to 
illustrate the PG and the PMMH samplers. 

We generate a set of observations 2/1:500 according to ^ with = 10 and 
(Tg = 1. The parameter priors are modelled as inverse gamma distributed with 
hyper-parameters a = b = 0.01. We then employ the PG sampler of [l] and the 
PG-BSi sampler of Algorithm[3]to estimate the posterior parameter distribution. 
Since the theoretical validity of the PG-BSi and the PG samplers can be assessed 
for any number of particles N > 2 (see Theorem [T]), it is interesting to see the 
practical implications of using very few particles. Hence, we run the methods 
four times on the same data wfth N = 5, 20, 1000 and 5000 {N = 5000 as 

was used in jlj). The parameters are initialised at 9{0) = (lO lO)^ in all 
experiments. 

In Figure[l]we show the estimated posterior means of the standard deviations 
ay and de (i.e. the square roots of the means of 9(1 : r)) vs. the number of 
MCMC iterations r. The left column shows the results for the PG sampler. 
When we use N — 5000 particles, there is a rapid convergence towards the 
true values, indicating that the method mixes well and quickly finds a region of 
high posterior probability. However, as we decrease the number of particles, the 
convergence is much slower. Even for N = 1000, the method struggles and for 
= 20 and TV = 5 it does not seem to converge at all (in a reasonable amount 
of time). The key observation is that the PG-BSi sampler (right column), on 
the other hand, seems to be more or less unaffected by the large decrease in the 
number of particles. In fact, the PG-BSi using just N = 5 performs equally well 
as the PG sampler using N = 1000 particles. 
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Figure 2: Particle tree generated by the CAPF at iterations r (left) and r + 1 
(right) of the PG sampler. The dots show the particle positions, the thin black 
lines show the ancestral dependence of the particles and the thick black lines 
show the sampled trajectories xi-xir) and xi-x^r + 1), respectively. Note that, 
due to degeneracy of the filter, the particles shown as grey dots are not reachable 
by tracing any of the ancestral lineages from time T and back. 

4.3 Why the big difference? 

To see why the PG-BSi sampler is so much more insensitive to a decreased 
number of particles, consider a toy example where we have generated T — 50 
observations from a 1st order linear system. Figure [2] (left) shows the particle 
tree generated by the CAPF at iteration r of a PG sampler, using = 30 
particles. Due to the degeneracy of the PF, there is only one distinct particle 
trajectory for t < 32. In the PG sampler, we draw xi;T{r) by sampling among 
the particles at time T and tracing the ancestral path of this particle. This 
trajectory is illustrated as a thick black line in the figure. At the next iteration 
of the PG sampler we run a CAPF, conditioned on this particle trajectory. 
This results in the tree shown in Figure [2] (right). Due to de generacy, we once 
again obtain only a single distinct particle trajectory for t < 29. Now, since 
we condition on the trajectory xi;Ti'f), the particle tree generated at iteration 
r + 1 must contain xi:Tir). Hence, all particle trajectories available at iteration 
r + 1 are identical to Xi.2g(r) up to time 29. Consequently, when we sample 
xi-.Ti^ + 1) at iteration r + 1, this trajectory will to a large extent be identical 
to xi:T{r)- This results in a poor exploration of the state space, which in tiun 
means that the Gibbs kernel will mix slowly. 

Based on this argument we also note that the mixing will be particularly 
poor when the degeneracy of the CAPF is severe. This will be the case if the 
length of the data record T is large and/or if the number of particles N is small. 
This is consistent with the results reported in the previous section. 

The reason for why PG-BSi can circumvent the poor mixing of the PG 
kernel is that the backward simulator explores all possible particle combinations 
when generating a particle trajectory, and is thus not constrained to one of the 
ancestral paths. Consider the particles generated by the CAPF at iteration r of a 
PG-BSi sampler, shown in Figure|3](left). A backward trajectory xi:T(f'), shown 
as a thick black line, is generated by a backward simulator. At iteration r + 1 of 
the PG-BSi sampler we will run a CAPF conditioned on xi;T{r), generating the 
particles shown in Figure |3] (right). When we again apply a backward simulator 
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Figure 3: Particles generated by the CAPF at iterations r (left) and r + 1 
(right) of the PG-BSi sampler. The dots show the particle positions and the 
thick black lines show the sampled backward trajectories xi-x^r) and xi^xir+l), 
respectively. Note that all the particles at all time steps are reachable by the 
backward simulator. 
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Figure 4: Running means for the estimated standard deviations and for 
the PMMH sampler (left) and MwPG sampler (right) for a maximum of 30000 
MCMC iterations. The four curves correspond to different number of particles. 
The "true" values are shown as thick solid lines. Note that the estimate of 
ay does not converge to this "true" value, but this is not surprising since we 
consider just one data realisation. Hence, the posterior mean may very well 
differ from the "true" value. 



to this collection of particles we will, with high probability, sample a trajectory 
xiixif + 1) which is entirely different from xi-T^r)- This leads to a much better 
exploration of the state space and thus a faster mixing Gibbs kernel. 



5 Metropolis within particle Gibbs 
5.1 The MwPG sampler 

Quite often, it is not possible to sample exactly from the conditional parameter 
density p{9 \ xi-TtVi-.t)- If this is the case, we can replace step 2(a) in Algo- 
rithm [3] with an MH update. This results in what we refer to as a Metropolis 
within particle Gibbs (MwPG) sampler. The name stems from the commonly 
used term Metropolis-within- Gibbs, which refers to a hybrid MH/Gibbs sampler 



(see 24 25 ). Hence, let us assume that x[.rp is a fixed trajectory and consider 
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Algorithm 4 MwPG - Metropolis within particle Gibbs 

1. Initialise: Set 0{0), xi;t{0) and ji:T(0) arbitrarily. 

2. For r > 1, iterate: 

(a) MH step for sampling a parameter: 

• Sample 61" ~ q{9 \ e{r - l),xi.,T{r - I)). 

• Compute p = p{0", 6{r — l),xi;T{r — 1)) according to 

• With probability p, set 9{r) ~ 9" , otherwise set 9{r) — 9{r — 1). 

(b) Rmi a CAPF targeting pg(^r){xi-T I Ui-t), conditioned on {xi-rir — 
l),Ji:T(r-l)}. 

(c) Run a backward simulator to generate ji:T{f)- Set xi:T{r) to the corre- 
sponding particle trajectory. 



the problem of sampling from p{9 \ x[.rp^ Ui-.t) using MH. We thus choose some 
proposal density q(9 \ 9' ^x'^.j^), which may depend on the previous parameter 
9' and the state trajectory x'^.j,. The proposal can also depend on the fixed 
measurement sequence yi-.Ti but we shall not make that dependence explicit. 
We then generate a sample 9" ^ q{9 \ 9' , x'l.rp) and accept this with probability 

which is a standard MH update (see e.g. [l3]). The acceptance probability can 
be computed directly from the quantities defining the model ([T]) since, 

T T-1 

p{9 I xi..T,yi:T) oc J|5e(yt | xt) J]^ feixt+i I xt)'ng{xi)p{9). 
t=i t=i 

By plugging this MH step into the PG-BSi sampler we obtain the MwPG 
method, presented in Algorithm |4] Due to the fact that the MH step will leave 
the target density invariant, the stationary distribution of the MwPG sampler is 
the same as for the PG-BSi sampler. MwPG can be seen as an alternative to the 
PMMH sampler by [l]. Note that, as opposed to case of PMMH, the acceptance 
probability ^ does not depend explicitly on the likelihood pijji-.r \ 9). 

5.2 Numerical illustration — estimating variances 



Let us return to the simulation example studied in Section [4.2| We use the same 
batch of data 2/i:5oo: generated from model Q with a1 ~ 10 and cTg — 1. As 
before, we wish to estimate the noise variances and set 9 — {a^,a^}, but now 
we apply the PMMH sampler of |l] and the MwPG sampler in Algorithm |4] 
As proposal kernel for the parameter, we use a Gaussian random walk with 
standard deviation 0.15 for tr^ and 0.08 for (both methods use the same 
proposal). 

Running means for the estimated standard deviations, for both methods, 
are shown in Figure |4] The difference between MwPG and PMMH is similar to 
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that between PG-BSi and PG, especially when we consider the effect of using 
few particles. Compared to the PG-BSi sampler operating on the same data 
(see Figure [l] and note the different scaling of the axes) the MwPG sampler is 
slower to converge. However, this is expected, since the latter does not make 
use of the conjugacy of the priors. 

In Figure [5] we show scatter plots for the Markov chains after 50000 itera- 
tions, using a burnin of 10000 iterations. From these plots, it is clear that the 
estimated posterior parameter distribution for the MwPG sampler is rather in- 
sensitive to the number of particles used, and similar to what we get from PMMH 
using N — 5000. The same effect can be seen by analysing the autocorrelation 
functions, shown in Figure [6] Finally, the average acceptance probabilities for 
the two methods and for different number of particles are given in Table [T] 

It is worth to note that, in the limit TV — >■ oo, the PMMH will "converge 
to" an idealised, marginal MH sampler. The MwPG, on the other hand, "con- 
verges to" an idealised Metropolis within Gibbs sampler. Hence, for a very large 
number of particles, we expect that PMMH should outperform MwPG. Further- 
more, and more importantly, whether or not MwPG is preferable over PMMH 
should be very much problem dependent. If there is a strong dependence be- 
tween the states and the parameters, we expect that the MwPG sampler should 
suffer from poor mixing, since it samples the parameters conditioned on the 
states, and vice versa. In such cases, the PMMH might perform better, since it 
more closely resembles a marginal MH sampler. 



□ 




□ 
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Figure 5: Scatter plots for {a^^Oe) 
for PMMH (left column) and MwPG 
(right column). From top to bot- 
tom; = 5, iV = 100, N = 1000, 
N = 5000. PMMH using iV = 5 and 
= 100 does not find the correct re- 
gion of high posterior probability, and 
the samples lie outside the axes. 
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Figure 6: Autocorrelation plots for 
CT„ (top row) and Ce (bottom row) 
for PMMH (left column) and MwPG 
(right column). There is a sharp drop 
in autocorrelation for MwPG for any 
number of particles. 



5.3 Numerical illustration — estimating all parameters 



So far, we have focused on estimating the noise variances in the model 

suggested to identify all the parame 
as a more challenging test of PMCMC. 



However, Johannes, Poison and Yae 26 suggested to identify all the parameters 
33, a, (J^, (Tg} in model 

4.2I but with (7.^ = 1 and 



They use the same parameter values as in Section 
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Table 1: Average acceptance probabilities 



N 5 100 800 2500 7000 10000 

PMMH 1.4 • lO-'^ 8.6 • 10-3 0.082 0.27 0.45 0.52 
MwPG 0.62 0.62 0.61 0.61 0.61 0.61 




0.48 0.5 0.52 25 28 31 




7.5 8 8.5 9 1.9 1.95 2 2.05 




0.7 0.85 1 1.15 3 3.15 3.3 



Figure 7: Posterior densities for the parameters of model ([8| . The true values 
are marked by vertical dashed lines. 

cTg = 10, and apply an alternative, SMC based method with 300000 particles 
to estimate the parameters. Here, we address the same problem with MwPG 
using iV = 5 particles and 60000 MCMC iterations. 

We use vague normal distributed priors for /32 and P^, a uniform prior 
over [1, 3] for a and inverse gamma priors with a = b = 0.01 for and a^. 
To put even more pressure on the method, Andrieu, Doucet and Holenstein 
[l] p. 338] suggested to use a larger data set. Therefore, we use T — 2000 
samples, generated from the model ([s]) and apply the MwPG sampler. To 
propose values for a, we use a Gaussian random walk with standard deviation 
0.02, constrained to the interval [1, 3]. For the remaining parameters, we make 
use of conjugacy of the priors and sample exactly from the conditional posterior 
densities. Figure [7] show the estimated, marginal posterior parameter densities 
for the six parameters, computed after a burnin of 10000 iterations. The method 
appears to do a good job at finding the posterior density, even in this challenging 
scenario and using only 5 particles. 

6 PG-BSi for joint smoothing 
6.1 Collapsed PG-BSi 

Assume now that the parameter 9 is known and that our objective is to solve 
the smoothing problem for the model ([I]). That is, we seek the joint smoothing 
density pg(xi-T \ Ui-.t) or some marginal thereof. Several particle based methods 
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have been presented in the hterature, devoted to this problem, see e.g. 14 - 17 
Many of these have quadratic complexity in the number of particles. Though, 
quite recently some advancements have been made towards linear complexity, 
see [T6[[T7]. In [l], the PIMH sampler is suggested as a PMCMC method tar- 



geting the joint smoothing density. The benefit of this is that the stationary 
distribution of the PIMH is the exact smoothing distribution, removing any 
bias that normally is present in more "traditional" particle smoothers. How- 
ever, if we run the PIMH for R iterations using N particles, the computational 
complexity is 0{NR). In order to obtain a reasonable acceptance probability 
N needs to be fairly high, rendering the method computationally demanding. 
It is desirable to find a similar method that works even with a low number of 
particles. 

The alternative that we propose in this paper is very straightforward. We 
simply run a PG-BSi according to Algorithm [3j but since 9 is fixed we skip 
step 2(a). This collapsed PG-BSi sampler will, similarly to PIMH, generate a 
Markov chain whose stationary distribution is the exact smoothing distribution. 
Though, a major difference between the two methods is that the collapsed PG- 
BSi sampler will accept all the generated particle trajectories, whereas the PIMH 
only will accept a portion of them in an MH fashion. The acceptance probability 
of the PIMH will typically deteriorate as we decrease the number of particles N 
or increase the length of the data record T (see [l| Figure 3]). 

The complexity of the collapsed PG-BSi sampler is still 0{NR), but since 
N is allowed to be fairly small in this case, we believe that this method could 
be a serious competitor to existing particle smoothers. Note that the method 
is exactly linear in the number of generated trajectories R. Furthermore, the 
PMCMC based smoothers are trivially parallelisable, simply by running multiple 
parallel chains. 



6.2 Multiple trajectories 

One possible extension, to further improve the performance of the collapsed PG- 
BSi, is to use multiple trajectories as suggested for the PIMH sampler in [21] . 
Assume that we wish to estimate E[h{xi;T) \ Ui-.t] for some function h and that 
we run R iterations of the PG-BSi sampler (possibly with some burnin) . The 
most straightforward estimator of h is then 

r— 1 

An alternative is to use the Rao-Blackwellised estimator 

1 ^ 

^RB = ;^ 51 ^ i^i-.Tir)) I Xl:T(r), i2:T(r)] . 

7' — 1 

To compute the expectation in this expression, we would have sum over all 
possible backward trajectories. Since there are N'^ possible trajectories, this is 
in general not feasible, unless h is constrained to be a function of only some small 
subset of the a;:s, e.g. a single Xt or a pair {xt,Xt+i}. However, a compromise 
would be to generate multiple backward trajectories in the backward simulator, 
providing an estimate of the above expectation. Hence, at iteration r of the 
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PG-BSi sampler, instead of generating just a single backward trajectory as in 
Algorithmjsj we generate M trajectories, for to = 1, . . . , M and use the 

estimator 



R M 
r— 1 m— 1 



See 21 for a more in-depth discussion on using multiple trajectories in the 
context of PIMH. Since the backward simulator generates conditionally inde- 
pendent and identically distributed (i.i.d.) samples (given {xi-^, i2:T}), we can 
set the state of the Markov chain to any of the generated backward trajectories, 
e.g. Xi..T{r) = x\.T{r). 



6.3 Numerical illustration 

To illustrate the applicability of the collapsed PG-BSi for joint smoothing, we 
again consider the same batch of T = 500 samples from the model ([s]) as used in 
Section |42] and Section [O] Wc apply the PIMH, the collapsed PG-BSi and a 
state of the art particle smoother, known as fast forward filter/backward simu- 
lator (fast FFBSi) derived in [16j. The PMCMC samplers are run using = 20 
particles for AI = 1000 iterations. The fast FFBSi is run using N — 1000 parti- 
cles and backward trajectories. To obtain something that we can visualise, let 
us consider the marginal smoothing density at time t = 151, i.e. peixi^i \ yi-.r)- 
Figure [s] shows the corresponding cumulative distribution functiorjj (CDF), as 
well as the empirical CDFs from the three methods. 

One question of interest is if it is possible to avoid burnin, by using a good 
initialisation of the MCMC sampler. In the example above we did not use any 
burnin and the chain was initialised by a run of an (unconditional) PF using N = 
20 particles, followed by a backward simulation. Another heuristic approach is 
to estimate the joint smoothing distribution by running a "standard" particle 
smoother, e.g. the fast FFBSi, with a moderate number of particles. We can 
then initiate the collapsed PG-BSi sampler by sampling from the estimated joint 
smoothing distribution, which is hopefully close to the stationary distribution. 
This heuristic can be of particular interest it we are able to run several parallel 
chains. We can then apply a fast FFBSi to generate M backward trajectories, 
which are conditionally i.i.d. samples from the empirical smoothing distribution, 
to initiate M independent chains. 



7 Epidemiological model 

As a final numerical illustration, we consider the identification of an epidemio- 
logical model using the MwPG sampler, based on search engine query observa- 
tions. Seasonal influenza epidemics each year cause millions of severe illnesses 



and hundreds of thousands of deaths world-wide 27 . Furthermore, new strains 
of influenza viruses can possibly cause pandemics with very severe effects on 
the public health. The ability to accurately predict disease activity can enable 
early response to such epidemics, which in turn can reduce their impact. 



^To obtain the "true" CDF, we run a fast FFBSi with 50000 particles and 10000 backward 
trajectories. 
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Figure 8: Empirical CDFs for the fast FFBSi, the collapsed PG-BSi and the 
PIMH. See text for details. 



We consider a susceptible/infected/recovered (SIR) epidemiological model 
with environmental noise and seasonal fluctuations (28j[29] , which is discretised 
according to the Euler-Maruyama method. 



S 

St+dt = St+ iiPdt - fiStdt - (1 + vf) /3{t)-^Itdt, 
It+dt = /t - (7 + ^^)Itdt + (1 + vl) (3{t)^Itdt, 



R 



■t+dt 



Rt + lltdt — ^Rtdt. 



(10a) 

(10b) 
(10c) 



Here, St, It and Rt represent the number of susceptible, infected and recovered 
individuals at time t (months), respectively. The sampling time dt is chosen as 
1/30 in our experiment, i.e. we sample the model once a day. The state of the 
system is comprised of Xt — [St It) ■ The total population size P — 10^ and 
the host birth/death rate /i — 0.0012 are both assumed known. The seasonally 
varying transmission rate is given by /3{t) — /3(1 + Q!sin(27r<:/12)) and the rate of 
recovery is given by 7. The process noise is assumed to be Gaussian according 
to {vf vif ~ AA(0, F^h/y/dt). 

In [29], the PMMH sampler is used to identify a similar SIR model, though 
with a different observation model than what we consider here (see below). A 
different Monte Carlo strategy, based on a particle filter with an augmented 



state space, for identification of an SIR model is proposed in 30 . Here, we use 
an observation model which is inspired by the Google Flu Trends project 27 . 
The idea is to use the frequency of influenza related search engine queries to 
infer knowledge about the dynamics of the epidemic 



27 



we 



As proposed by 

use a linear relationship between the observations and the log-odds of infected 
individuals, i.e. 



yt = plog 



It 
P-It 



et. 



'AA(0 



(lOd) 
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Figure 9: Posterior densities for the parameters of model (10). The true values 
are marked by vertical dashed lines. 



The parameters of the model are 9 = {7, /3, a, F, p, r}, with the true values 
given by 7 = 3, /3 = 30.012, a = 0.16, F = 0.03 , p = 1.1 and t ^ 0.224. 
We place a normal-inverse-gamma prior on the pair {p, r^}, i.e. p{p,t'^) — 
N{p] ^p,CpT'^)IQ{T'^;ar,hr)- The hyperparameters are chosen as /ip = 1, Cp — 
0.5 and ar = for = 0.01. We place an inverse gamma prior on F"^ with hyper- 
parameters ap — bp — 0.01 and a normal prior on 7 with ji^ — "i and a'i — 1. 



The above mentioned priors are conjugate to the model defined by (10), which 



is exploited in the MwPG sampler by drawing values from their conditional 
posteriors. For the parameters /3 and a we use flat improper priors over the 
positive real line and sample new values according to MH moves, as described 
in Section [5] Samples are proposed independently for j3 and a, according to 
Gaussian random walks with variances 5 • 10"'^ and 5 • 10~^, respectively. 

We generate 8 years of data with daily observations. The number of infected 



individuals It over this time period is shown in Figure 10 The first half of the 



data batch is used for estimation of the model parameters. For this cause, we 
employ the MwPG sampler using iV = 20 particles for 100000 MCMC itera- 
tions. Histograms representing the estimated posterior parameter distributions 
are shown in Figure [9j We notice some raggedness in some of the estimated 
marginals. This implies that they have not fully converged, suggesting a fairly 
slow mixing of the Markov chain. Still, the estimated distributions seem to 
provide accurate information about the general shapes and locations of the pos- 
terior parameters distributions, and the true parameters fall well within the 
credible regions of the posteriors. 

Finally, the estimated model is used to make 1 month ahead predictions of 



the disease activity for the subsequent 4 years, as shown in Figure 10 The 
predictions are made by Monte Carlo sampling from the posterior parameter 
distribution, followed by a run of a particle filter on the validation data, using 
100 particles. As can be seen, we obtain an accurate prediction of the disease 
activity, which falls within the estimated 95 % credibility intervals, one month 
in advance. 
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Figure 10: Disease activity (number of infected individualt If) over an 8 year 
period. The first 4 years are used as estimation data, to find the unknown pa- 
rameters of the model. For the consecutive 4 years, one month ahead predictions 
are computed using the estimated model. 

8 Conclusions and future work 



In the original PMCMC paper, Andrieu, Doucet and Holenstein writes "PM- 
CMC sampling is much more robust and less likely to suffer from this deple- 
tion problem [referring to the PF]. This stems from the fact that PMCMC 
methods do not require SMC algorithms to provide a reliable approximation of 
Pe{xi:T I Ui-.t), but only to return a single sample approximately distributed 
according to pq{xi-t \ yi-.r)"- 

This observation is the very essence of the PMCMC methods and the reason 
for why they have shown to be such powerful tools for joint Bayesian state and 
parameter inference. In this paper we have argued that the use of backward 
simulation in PMCMC make this claim even stronger. Three methods have been 
presented, each similar to one of the methods derived in [l]. These are; PG-BSi 
(similar to PG), MwPG (similar to PMMH) and collapsed PG-BSi (similar to 
PIMH). 

We have shown experimentally that all of these methods work properly, 
even when we use very few particles in the underlying PFs. This robustness 
to a decreased number of particles stems from a simple modification of the PG 
sampler, namely to add a backward simulation sweep to reduce the correlation 
between the particle trajectories sampled at any two consecutive iterations. 
This modification does not only enable the application of PMCMC methods 
with very few particles, but will also allow us to treat larger data sets. 

It would be interesting to analyse how one should choose the number of 
particles. The mixing of the MCMC kernels will increase as we increase N, but 
the improvement seems to saturate rather quickly (see e.g. Figure [6]) . Based 
on this, we believe that it in general is better to use fewer particles and more 
MCMC iterations. However, deducing to what extent this statement is problem 
dependent is a topic for future work. 

Another interesting avenue for future work is regarding theoretical aspects 
of the mixing of the PG-BSi sampler. That is, whether or not it is possible 
to show that PG-BSi always will mix faster than PG for the same number of 
particles. We believe that this is the case, and some insight into the matter 
can be gained by considering the maximal correlations for the two methods (see 



e.g. 31 for a general discussion) 
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When it comes to a comparison between the MwPG and the PMMH sam- 
plers, it is not as clear to deduce which method that is preferable over the other. 
In fact, we believe that this is very much problem dependent, and that the two 
methods are complementary. If there is a strong dependence between the states 
and the parameters, the MwPG sampler should suffer from poor mixing, since 
it samples these groups of variables conditioned on each other. Basically, this 
is the case if there is little noise in the system, or more severely, if the model is 
degenerate (e.g. if noise only enters on some of the states). In the latter case, 
the method might not even converge, since it is not clear that Assumption [2] is 
fulfilled. In such cases, the PMMH sampler might be a better choice. On the 
other hand, from the numerical examples that we have considered, it is clear 
that the MwPG sampler can substantially outperform the PMMH sampler in 
other cases. It would be of great interest to conduct a more in-depth comparison 
between these two methods, to provide some general guidelines for when to use 
which method. 



A. 2 Proof of Proposition [3] 

We start by noting that, from the definition of the weight function we have 
for any fc e {1, . . . , N}, 



A Proofs 



A.l Proof of Proposition [2] 

Proposition [2] follows directly from Proposition [l] and the fact that 
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Furthermore, it holds that 
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Using the above results, we can write 



0(jl:T I 6',Xl:T,i2:T) 



N 



N 



pei^tr I yi.T) n X n n ^^'i^t 



m—1 



V 



m—l 



Wt-i/e(a;f | 



N 



T-1 / N 



^pe{x'^:^ii I yi:T-i) n X n I n ^^'(c,^r 



m— 1 



t=2 \ r?j.= l 



X w: 



By repeating the above procedure we get 



which completes the proof. 



A. 3 Proof of Theorem [T] 

We have already assessed that is a stationary distribution of the PG-BSi sam- 
pler, (^-irreducibility and aperiodicity of the PG-BSi sampler under Assump- 
tion IT] and Assumption [2] can be assessed analogously as for the PG sampler, 
see [l] Theorem 5] . Convergence in total variation then follows from ^32j Theo- 
rem 5]. 
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